MSDS453 - Research Assignment 01 Method 1 - First Vectorized Representation

  1. tokenization + normalization
  2. Tf-idf
  3. Word2vec and Doc2vec 100, 200, 300 Screen Shot 2023-01-16 at 11.02.39 AM.png

Our goal in this exercise is to BEGIN coming to a common agreement, among this class, as to what terms we will use as we selectively refine our corpus-wide vocabulary. This corpus vocabulary is what would represent the content of each different document for clustering and classification purposes, which will be our next step. This means that we need to make decisions - what is in, what is out.

Mount Google Drive to Colab Environment

Directories Required for Research Assignment:
1. Data Directory - Source Class Corpus Data
2. Output Directory - Vocabulary

Uncomment To Map Drive

NLTK Downloads

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community

https://pypi.org/project/gensim/
Suppress warning messages

Utility Functions

Loading the corpus

  1. Dataframe = corpus_df
  2. List = documents (Document ID, Text)

Exploritory Data Analysis

Number of Reviews By Genre

The counts of reviews are balanced across the four genre of movies: action, comedy, horror, and sci-fi.

Normalized Document

  1. remove_punctuation(text)
  2. lower_case(text)
  3. remove_tags(text)
  4. remove_special_chars_and_digits(text)
  5. return Document(document.doc_id, text)

Standardize Document

NLTK Tokenizer Package

https://www.nltk.org/api/nltk.tokenize.html

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

Functions for Tokenization Process

  1. tokenize_document
  2. tokenize_documents

Useful Lookups (Titles by DocID, Genres by DocID, Description by DocID)

index(['DSI_Title', 'Text', 'Submission File Name', 'Student Name', 'Genre of Movie', 'Review Type (pos or neg)', 'Movie Title', 'Descriptor', 'Doc_ID'], dtype='object')

Lookup for Specific Movie Title

Qualitative Term Determinations

Terms Determined by Document of Interest

CountVectorizer

sklearn.feature_extraction.text.CountVectorizer:
Convert a collection of text documents to a matrix of token counts.
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

TF-IDF Experiment Functions - Text Normalization

TF-IDF (Term Frequency-Inverse Document Frequency)

Experiments, Normalize, Tokenize, Lemmatization and Stop Word Removal

Word2Vec

Word2vec embeddings: https://radimrehurek.com/gensim/models/word2vec.html
This module implements the word2vec family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces.

Utility Functions For Word2Vec Experiments

Word2Vec Experiments:

Word2vec embeddings: https://radimrehurek.com/gensim/models/word2vec.html
This module implements the word2vec family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces.

Word2Vec BaseLine Method 1 + 100

Utility functions for Doc2Vec experiments

Doc2vec Paragraph Embeddings: https://radimrehurek.com/gensim/models/doc2vec.html
Paragraph and document embeddings via the distributed memory and distributed bag of words models from Quoc Le and Tomas Mikolov: “Distributed Representations of Sentences and Documents”.
The algorithms use either hierarchical softmax or negative sampling;

Doc2Vec Experiments:

Doc2vec Paragraph Embeddings: https://radimrehurek.com/gensim/models/doc2vec.html
Paragraph and document embeddings via the distributed memory and distributed bag of words models from Quoc Le and Tomas Mikolov: “Distributed Representations of Sentences and Documents”.
The algorithms use either hierarchical softmax or negative sampling;

Word2vec and Doc2vec Method 1 + 200

Word2vec and Doc2vec Method 1 + 300